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Abstract 

• We present a high-performance JV-body code for self-gravitating collisional systems accelerated with the aid of a new SIMD 
instruction set extension of the x86 architecture: Advanced Vector extensions ( AVX) , an enhanced version of the Streaming SIMD 
'^Extensions (SSE). With one processor core of Intel Core i7-2600 processor (8MB cache and 3.40 GHz) based on Sandy Bridge 
Q micro-architecture, we implemented a fourth-order Hermite scheme with individual timestep scheme (Makino & Aarseth, 1992), 
S—j and achieved the performance of ~ 20 giga floating point number operations per second (GFLOPS) for double-precision accuracy, 
C/3 which is two times and five times higher than that of the previously developed code implemented with the SSE instructions 
C^f Nitadori et al., 2006b), and that of a code implemented without any explicit use of SIMD instructions with the same processor 
core, respectively. We have parallelized the code by using so-called NINJA scheme (Nitadori et al., 2006a), and achieved ~ 90 
^s^j GFLOPS for a system containing more than JV = 8192 particles with 8 MPI processes on four cores. We expect to achieve about 10 
^ iera FLOPS (TFLOPS) for a self-gravitating collisional system with JV ~ 10 5 on massively parallel systems with at most 800 cores 
{*""*) with Sandy Bridge micro-architecture. This performance will be comparable to that of Graphic Processing Unit (GPU) cluster 
systems, such as the one with about 200 Tesla C1070 GPUs (Spurzem et al., 2010). This paper offers an alternative to collisional 
C*** JV-body simulations with GRAPEs and GPUs. 
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J> 1. Introduction 
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, JV-body simulations arc a powerful tool to follow dynam- 
C^.ical evolution of self-gravitating many-body systems, such 
as planetary systems, star clusters, galaxies, galaxy clus- 
ters, and the large scale structure in the universe. In an JV- 
body simulation, we solve the following equation of motion 
of particles: 
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where G is the gravitational constant, e is the softening 
length, JV is the number of particles, 77ij and are re- 
spectively the mass and position of ith particle, and rtj = 
Tj—Ti. For self-gravitating collisional systems such as plan- 
etary systems and star clusters, one needs to keep suffi- 
ciently good accuracy and hence resort to a brute-force 
scheme, a direct scheme to compute gravitational forces. 
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In fact, the number of particles adopted in recent numeri- 
cal simulations of these systems is not always large enough. 
For example, the number of stars in a single globular clus- 
ter is typically ~ 10 6 , whereas the number of particles in 
recent numerical simulation of globular clusters is ~ 10 5 
at most. This is the reason why efficient ways to compute 
gravitational force for a given set of particles are intensively 
studied in many aspects. 

The use of external hardwares is one of the most pop- 
ular ways to compute gravitational force very quickly. 
GRAPEs (Ito et al, 1991; Makino et al., 1997, 2003), 
which are special purposed accelerators for JV-body simu- 
lations, have greatly contributed to JV-body simulations of 
collisional systems. Recently, Nitadori (2009) and Gaburov 
et al. (2009) explored the capability of commodity Graph- 
ics Processing Units (GPUs) as hardware accelerators for 
JV-body simulation and achieved a performance close to 
that of GRAPE-DR (Makino, 2008), the latest version of 
the GRAPE family. 

Another approach to accelerate the computation of grav- 
itational forces is to utilize SIMD (Single Instruction, Mul- 
tiple Data) instructions, which can perform the same op- 



eration on multiple data simultaneously, implemented on 
recent central processing units (CPUs) instead of making 
use of external hardware accelerators. Since Pentium III 
released by Intel Corporation in 1999, CPUs with x86 ar- 
chitecture have supported a set of SIMD instructions called 
Streaming SIMD Extensions (SSE). Nitadori ct al. (2006b) 
(NMH06) applied the SSE instructions to the force calcu- 
lation in A-body simulations, and calculated four interac- 
tions among particles in parallel. NMH06 achieved about 
three times higher performance than the case without the 
use of the SSE instructions. Furthermore, Nitadori et al. 
(2006a) (NMA06) demonstrated that the performance of 
CPUs with the aid of the SSE instructions is comparable 
to GRAPEs and CPUs in massively parallel simulations. 

Recently, a new processor family with Sandy Bridge 
micro-architecture has been released by Intel Corporation. 
The processor supports a new set of instructions known 
as Advanced Vector extensions (AVX), an enhanced ver- 
sion of the SSE instructions. In the AVX instruction set, 
the width of the SIMD registers are extended from 128- 
bit to 256-bit. Hence, an AVX instruction set is able to 
process twice amount of data than the SSE instruction 
set. This suggests that a processor with the AVX instruc- 
tion set should be able to compute accelerations twice as 
fast compared to its SSE-only counterpart. The AVX in- 
structions are also supported by the next-generation CPU 
" Bulldozer" released by AMD Corporation. 

In this paper, we present a new TV-body code imple- 
mented using the AVX instructions. In our ./V-body code, 
we adopt a fourth-order Hermite scheme with individual 
timestep scheme (Makino & Aarseth, 1992) for its time in- 
tegration, which is widely used for collisional A-body sim- 
ulations. We have achieved 20 giga floating point number 
operations per second (GFLOPS) per processor core for 
A-body simulations. This performance is two times higher 
than the case in which we use the SSE instructions. 

This paper is organized as follows. In section 2, we out- 
line the algorithm of the fourth-order Hermite scheme. In 
section 3, we describe our implementation for the A-body 
code using the AVX instruction set. In section 4 and 5, we 
show an accuracy and performance of our A-body code, 
respectively. The results are summarized and discussed in 
section 6. 



2. Algorithm 

The fourth-order Hermite scheme is a kind of a predictor- 
corrector method, developed by Makino & Aarseth (1992). 
We outline its algorithm below. The predictor of ith particle 
is given by Taylor series as 
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where Ati, r^, V{, di, and j i are the timestep, position, ve- 
locity, acceleration, and its first-order time derivative (jerk) 
of ith particle, respectively, and the superscripts (0) and 
(p) indicate quantities at the current time and predicted 
quantities, respectively. Using the predicted position and 
velocity, r\ and v[ p \ the acceleration and jerk of ith par- 
ticle are evaluated as 
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and the superscript (1) indicates quantities at the 

next time. Additionally, the potential of ith particle, <p^\ 
given by 
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is computed for checking the validity of the energy conser- 
vation. 

The corrector is based on the third-order Hermite inter- 
polation constructed using the old acceleration and jerk, 
and the new ones (a[°\ jf \ , and respectively). 
From the third-order Hermite interpolation polynomial, we 
can obtain the second-order and third-order time deriva- 
tives of acceleration, which are so-called snap (s^) and 
crackle (c\ ), respectively. The formulas for the snap and 
crackle are expressed as 
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The corrections for the position and velocity are given by 
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Usually, individual timestep scheme is adopted in the 
Hermite scheme. The timestep of ith particle is given by 

^ — " I , .mi., mi, , (o) r I ' 



where -q is an accuracy parameter (Aarseth, 1985). Practi- 
cally, we employ a hierarchical timestep scheme (Makino, 
1991) for SIMD programing and parallclization; we dis- 
crctize the timcsteps in a power of two hierarchy, where all 
timesteps are shorter or equal to the timesteps obtained 



by equation (11), and are a power of two subdivision of a 
predefined global timcstep. 

A detailed procedure of the Hcrmite scheme is as follows. 

1. Select particles to be integrated (hereafter, i- 
particles), whose next times are nearest among all 
the particles. 

2. Predict the positions and velocities of i-particles. 

3. Predict the positions and velocities of particles which 
exert forces on z-particles (hereafter, j'-particles) . 

4. Calculate accelerations, jerks, and potentials of i- 
particles. 

5. Construct the Hcrmite interpolation for i-particlcs. 

6. Correct the positions and velocities of z-particles. 

7. Update and j\ by substituting and j^, 
respectively. 

8. Calculate the next timesteps of i-particlcs. 

9. Return to the step 1. 

The step 2 is a duplication of the step 3. This duplication 
has small overhead, and is convenient when we perform this 
scheme in parallel. 

3. Implementation 

We use the AVX instructions to implement force calcu- 
lation in the Hcrmite scheme, i.e. the step 4. A cost for the 
force calculation is order of TV 2 and that for the other parts 
is order of TV; thus the force calculation dominates the total 
calculation time. Below, we describe the implementation of 
the force calculation. 

3.1. The SSE and AVX instructions 

Before we describe our implementation, let us briefly 
summarize the SSE and AVX instructions and the differ- 
ence between them. 

The SSE instruction set is a SIMD instruction set intro- 
duced for the first time in Pentium III processors to improve 
the performance of media streaming, image processing and 
three-dimensional graphics by executing 4 single-precision 
(SP) or 2 double-precision (DP) floating-point operations 
in parallel. Operations supported by the SSE instruction 
set include addition, subtraction, multiplication, division, 
square root, inverse square root, etc. In such operations, 
dedicated registers with 128- bit length called "XMM reg- 
isters" are used to store the 4 SP and 2 DP floating-point 
numbers. Although operations between data in the XMM 
registers and in main memory are supported, those with all 
data stored in the XMM registers are faster. The available 
number of the XMM registers in a single processor core is 
8 for x86 processors and 16 for x86_64 processors. NMH06 
utilized the SSE instruction set to accelerate the force cal- 
culations in their collisional A-body code. 

The AVX instruction set is an enhanced version of the 
SSE instruction set, and there are two major differences 
between them. One is the number of data on which an in- 
struction is operated in parallel. The length of the dedi- 



cated registers for the AVX instructions, "YMM registers" , 
is 256 bits, two times longer than that of the XMM regis- 
ters. Thus, 8 SP or 4 DP floating-point operations can be 
carried out simultaneously by the AVX instructions, while 
the SSE instructions can execute 4 SP floating-point op- 
erations or 2 DP floating-point operations in parallel. The 
lower 128 bits of the YMM registers are regarded as the 
XMM registers used by the SSE instructions for backward 
compatibility. The number of the YMM registers in one 
processor core is 16, equal to that of the XMM registers. 

The other is that the number of operands of most in- 
structions has been increased from two in the SSE instruc- 
tion to three operands in the AVX instructions. Two of the 
three operands are source operands, and the other one is 
a destination operand, where the result of the operation 
is stored. Owing to this, while one of source operands is 
used as a destination operand in the SSE instructions, and 
overwritten, we can preserve both source operands at each 
AVX instruction, and use a limited number of the YMM 
registers very efficiently. 

Methods to implement the code explicitly using the SSE 
and AVX instructions should be also addressed. In prin- 
ciple, if the compilers were clever enough to detect any 
concurrent loop, it could generate the code that effectively 
utilize these instructions. In reality, however, the present- 
day compilers cannot fully resolve the dependency among 
variables in the loops very well. Therefore, we have to im- 
plement the SSE and AVX instructions explicitly using 
assembly-languages or compiler-dependent intrinsic func- 
tions. Intrinsic functions are supported by several compil- 
ers and easy to program with. On the other hand, inline 
assembly-languages, we can manually control the assign- 
ment of the XMM and YMM registers to computational 
data, and minimize the access to the memory or the cache 
memory by optimizing the use of individual registers. As 
already noted above, calculations using SSE and AVX in- 
structions are very efficient if all the data are stored in the 
dedicated registers. Therefore, the optimization of the use 
of the XMM and YMM registers is of crucial importance. 
In this work, we adopt an implementation of the SSE and 
AVX instructions using inline-assembly embedded in C- 
language. In the followings, we only present implementa- 
tions and results for x86_64 processors with 16 XMM and 
YMM registers. 

3.2. Arithmetic precision 

In our implementation, we perform the force calculation 
in the "mixed" precision, in which, as is done in NMH06, 
only the relative position vector between i- and j-particles 
(j-yy in equation (4), (5), and (6)), and the accumulation 
of individual acceleration and gravitational potential (sum- 
mation in equation (4) and (6)) are computed in DP, and 
the remaining portions of the force calculation are done in 
SP. 

In computing accelerations, jerks, and potentials in equa- 



tions (4), (5) and (6), respectively, we calculate the inverse 
square root of (r-j') 2 + e 2 using the VRSQRTPS instruc- 
tion, which computes an approximated value of the inverse 
square root very quickly to an accuracy of 12 bits. To obtain 
an accuracy equivalent to SP, a Newton- Raphson iteration 
is applied, such that 

x\ = -^x Q (axl - 3), (12) 

where xq is an initial guess for 1/y/a, and x\ is improved 
value of 1/y/a. NMH06 reported that values returned by 
the VRSQRTPS instruction contain statistical bias depen- 
dent on implementation of this instruction. We statistically 
correct the bias in the same way as NMH06. 

3.3. SIMD parallelization of the force calculation using the 
AVX instruction set 

Since we perform the force calculation mostly in SP, we 
calculate eight interactions between i- and j-particles si- 
multaneously. In our implementation, interactions between 
four j-particles and two i-particles are calculated in par- 
allel. Of course, there exist other possible combinations of 
interactions such as four i-particles and two j-particlcs. We 
have compared the performance of various combinations of 
interactions and found that this choice exhibits the best 
performance among others by 5-10%. 

The calculation of forces on two i-particles (a force loop) 
in our implementation consists of three parts as follows. 

1. Prepare the i- and j-particle data in a suitable form 
for SIMD calculations. 

2. Calculate the forces on the two i-particles exerted by 
all j-particles on the YMM registers. 

3. Write back the calculated forces of the two i-particles 
on the YMM registers into the memory. 

The steps 1 and 3 are written in C language, while the step 
2 is written in C language with inline assembly language. 

First, we describe the implementation of the steps 1 and 
3. List 1 shows the definitions of structures for i- and j- 
particles as well as a structure for the results (accelera- 
tions, jerks, and potentials) used in our implementation. 
Before computing the accelerations, jerks, and potentials 
of i-particles, the positions, velocities, and indices of i- 
particles are stored in an array of the structure, Ipart, and 
the positions, velocities, masses, and indices of j-particles 
are stored in an array of the structure Jpart. The calcu- 
lated accelerations, jerks, and potentials are stored into an 
array of the structure NewAccJerk. 

List 1. Structures for i-particles, j-particles and the resulting accel- 
e rations, 

1 struct Ipart{ 

2 double xpos [8] ; // (iO, iO , iO , iO , il, il, il, il) 

3 double ypos[8] ; 

4 double zpos [8] ; 

5 float id [8] ; 

6 float xvel [8] ; 

7 float yvel [8] ; 

8 float zvel [8] ; 



9 >; 

10 struct Jpart{ 

11 double xpos [4]; // (jO, jl, j2, j3) 

12 double ypos [4] ; 

13 double zpos [4] ; 

14 float id [8]; // (jO, jl, j2, j3, jO, jl, j2, j3) 

15 float mass [8] ; 

16 float xvel [8] ; 

17 float yvel [8] ; 

18 float zvel [8] ; 

19 }; 

20 struct NewAccJerk{ 

21 double xacc ; 

22 doubl e yac c ; 

23 double zacc ; 

24 doubl e pot ; 

25 float xjrk; 

26 float yjrk; 

27 float zjrk; 

28 }; 

Note that the size of each array in the structures of i- and 
j-particlcs is multiple of 256 bits, the length of the YMM 
registers, so that each array can be readily loaded onto the 
YMM registers. The structures of i-particles (ipart) and 
j-particles ( Jpart) contain the data of two i-particles and 
four j-particles, respectively. In figure 1, we show the as- 
signments of particle indices in the arrays in these struc- 
tures, where iO and il, and jO, jl, j2, and j3 are the indices 
of i- and j-particles, respectively Such assignments enable 
us to calculate the accelerations, jerks, and potentials of 
two i-particlcs from four j-particlcs in a efficient manner 
with the AVX instructions. In the step 1, we arrange the 
particle data and substitute them into the structures Ipart 
and Jpart before calculating the accelerations and jerks in 
the step 2. 

Next, we describe the implementation of the step 2, in 
which the accelerations, jerks, and potentials of i-particles 
are computed using the AVX instructions. In issuing the 
AVX instructions, we use the inline assembly language and 
manually control the assignment of the YMM registers to 
the computational data to achieve a good performance. We 
also try to obtain a high issue rate of the AVX instructions 
by optimizing the order of operations so that operands in 
adjacent instruction calls do not have dependencies as much 
as possible. 

For the readability of the source code, we introduce a 
set of preprocessor macros of the AVX instructions. The 
macros used in our implementation are shown in List 2. 
Most of them have three operands, in which srcl and src2 
are the source operands and the results of the instructions 
are stored in the last operand (destination operand, dst). In 
these macros, operands named sre, srcl, src2 and dst are 
in the XMM/YMM registers, and those named mem indicate 
the main memory or the cache memory. Brief descriptions 
of these macros are summarized in Table 1. More detailed 
documents on the AVX instructions can be found in Intel's 
website. 1 

List 2. Preprocessor macros for inline assembly codes. 

1 #define VZERQALL asm ( " vzeroall " ) ; 
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Table 2 

Aliases of the YMM registers and the assignment of variables to the 
registers. 



Alias 


ID 
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YMM00 


y.ymmO 
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#deiine VCMPNEQPS (srcl , src2 , dst) \ 
J asm ( " vcmpps u 7.0 , u 7. " sr c 1 ", u 7."src2 ", u 7."dst " u \ 

i "::"g"(4)); 

[ #define PREFETCH (mem) asm ( " pref et cut U 7.0 11 : : " m " (mem) ) 

Table 2 shows the assignment of the YMM registers to 
the physical quantities in the calculations of accelerations, 
jerks, and potentials, where the subscripts of x, y and z 
indicates x, y, and z components of vectors, respectively. 
Hereafter, for simplicity, we omit the superscripts (p) and 
(1) of r and v, and of a, j and 4>, respectively. With this as- 
signment of the YMM registers, we can carry out the com- 
putation using the data in only the YMM registers except 
for one load instruction for each j-particlc data. 

Figure 2 shows a schematic illustration of a force loop 
to calculate accelerations, jerks, and potentials on two i- 
particlcs from four j-particles by using the AVX instruc- 
tions. The variables with a subscript i and j in figure 2 con- 
tains the 8 SP data of i- and j-particles in the same order 
of indices shown in figure 1 , and those with a subscript ij in 
figure 2 such as and Vij contains the data with indices of 
i- and j-particles in the order shown in figure 3. In figure 2, 
the blocks DX, DY and DZ indicate the calculation of relative 
coordinates between two i-particlcs and four j-particlcs for 
x, y and z components, respectively, in the mixed precision, 
and the block SUM indicates the accumulation of physical 
quantities (accelerations and potentials) of two i-particles 
interacting with four j-particles in the DP. The details of 
the blocks DX, DY, DZ and SUM are described later. 

First we describe the procedure depicted in figure 2. 

1. Calculate r.y iX , ry ;J/ , and Ti^ z in DX, DY, and DZ (see 
figure 4), and store them into YMM03, YMM04, and 
YMM05, respectively. 



Table 1 

Descriptions of the macros for inline assembly codes, "sre", "srcl", "src2" and "dst" indicate the YMM registers and "mem" indicates the 
main memory or the cache memory. 



VZEROALL 


zero out all registers 


VLOADPS(mem,dst) 


load 8 SP data from mem into dst 


VLOADPD(mem,dst) 


load 4 DP data from mem into dst 


VSTORPS(src,mem) 


store 8 SP data in sre into mem 


VSTORPD(src,mem) 


store 4 DP data in sre into mem 


VADDPS ( sr c 1 , src2 , dst ) 


add 8 SP data in srcl and src2 and store the results to dst 


VADDPD ( sr c 1 , src2 , dst ) 


add 4 DP data in srcl and src2 and store the results to dst 


VSUBPS ( sr c 1 , src2 , dst ) 


subtract 8 SP data in srcl from 8 SP data in src2 and store the results to dst 


VSUBPS_M (mem , sr c , dst) 


subtract 8 SP data in mem from 8 SP data in sre and store the results to dst 


VSUBPD_M (mem , sr c , dst ) 


subtract 4 DP data in mem from 4 DP data in sre and store the results to dst 


VMULPS ( sr c 1 , src2 , dst ) 


multiply 8 SP data in srcl and src2 and store the results to dst 


VMULPS_M (mem , sr c , dst ) 


multiply 8 SP data in mem and sre and store the results to dst 


VHADDPD (sre 1 , sr c2 , dst ) 


add pairs of adjacent SP data in each of srcl and src2 and store the results to dst 


VRSQRTPS(src,dst) 


return the inverse square root of 8 SP data in sre and store the results to dst 


VCVTPD2PS(src,dst) 


convert 4 DP data in sre to 4 SP data and store the results to the lower 128 bits of dst 


VCVTPS2PD(src,dst) 


convert 4 SP data in lower 128 bits of sre to 4 DP data and store the results to dst 


VMERGE(srcl,src2,dst) 


concatenate data in the lower 128 bits in srcl and src2 and store the results to dst 


VUP2L0W(src,dst) 


copy the upper 128 bits in sre to the lower 128 bits in dst 


VANDPS ( sr c 1 , src2 , dst ) 


operate bitwise logical AND on 8 SP data in srcl and src2 and store the results to dst 


VCMPNEQPS (sre 1 , sr c2 , dst ) 


compare 8 SP data in srcl and src2 and the unequal field of dst 


PREFETCH (mem) 


prefetch data on mem to the cache memory by one cache line (64 bytes) 



2. Square Tij >x in YMM03, r^^ in YMM04, and r y , z in 
YMM05, and then sum up them and the squared soften- 
ing length to compute the softened squared distance, 
rfj = rfj + e 2 , and store it into YMM01. 

3. Calculate an inverse square root of r?-, operate 
Newton- Raphson iteration for that value (the block 
N.R. in figure 2), and store the result 1/fij into 
YMM01. If indices of i- and j- particles are the same, 
zero is stored in the corresponding clement of YMM01 
(the block MASK in figure 2). 

4. Multiply 1/fij in YMM01 by rrij in the main mem- 
ory obtaining the gravitational potential exerted by 
j-particles 4nj = rrij/fij, and store the result into 
YMM02. 

5. Calculate the summation of rrij/fij in YMM02 over the 
subscript j (the block SUM in figure 2) , and accumulate 
the results into YMM09 (the block PHI in figure 2). (see 
also figure 5) 

6. Load Vj. x , Vj.y, and Vj lZ to YMM06, YMM07, and YMM08, 
respectively. 

7. Subtract Vi lX , Vi^ y , and Vi iZ in the main memory from 
Vj lX in YMM06, Vj tV in YMM07, and Vj, z in YMM08, and 
store the results (vij >x , Vij tV , and uy,,) into YMM06, 
YMM07, and YMM08, respectively. 

8. Calculate inner product of relative position and rel- 
ative velocity vectors, using in YMM03, rij y in 



YMM04, nj, z in YMM05, V ijtX in YMM06, v ijt y in YMM07, 
and Vij z in YMM08, and store the result (r, 3 • Vij) into 
YMMOO. 

9. Square 1/fij in YMM01, and store the result 1/fij into 
YMM01. 

10. Multiply rrij/fij in YMM02 and ■ v t j in YMMOO by 
1/fij in YMM01, and store the results (rrij/ffj and r„ • 
Vij/rfj) into YMM02 and YMMOO, respectively. 

11. Multiply Tij, x in YMM03, nj lV in YMM04, and r ij:Z in 
YMM05 by rrij/ffj in YMM02 obtaining the acceleration 
vector ay exerted by j-particles, and store the results 
into YMM03, YMM04, and YMM05, respectively. 

12. Calculate the summation of mjTij^/f^j in YMM03, 
m j r ij,y/?ij m YMM04, and rrijTij^/ffj in YMM05 over 
the subscript j (the block SUM in figure 2), and accu- 
mulate the results into in YMM10, YMM11, and YMM12, 
respectively. These correspond to the blocks AX, AY, 
and AZ in figure 2, respectively. 

13. Multiply Vij lX in YMM06, V^ y in YMM07, and Uy^ in 
YMM 08 by irij/ffj in YMM02, obtaining the first term 
of the jerks in equation (5), and store the results into 
YMM06, YMM07, and YMM08, respectively. 

14. Accumulate the first term of the jerks nijVij <x /ffj in 
YMM06, mjVij t y/ffj in YMM07, and TOjUy iZ /f| '• in YMM 08 
into YMM13, YMM14. and YMM15, respectively. 

15. Multiply mjrij, x /ffj in YMM03, mjr^^/f^ in YMM04, 
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(il.jl) 


(UJ2) 





Fig. 3. Data assignment in the YMM registers in figure 2. 



and 



,lrl 



in YMM05 by r i3 ■ Vi 



Irit in YMMOO 



16. 



17. 



. (J ... .... . JJ I/JJ 

obtaining the second term of jerks in equation (5), 
and store the results into YMM03, YMM04, and YMM05, 
respectively. 

Accumulate the second term of jerks m^r^ ■ 
Vij^ij^/ffj in YMM03, m^r^ ■ v^r^.y/r^ in YMM04, 
and nijirij ■ v^r^^/r^ in YMM05 into YMM13, YMM14, 
and YMM15, respectively. 

Return to the step 1 until all the j-particles are pro- 
cessed. 



The calculation of the relative coordinates between two i- 
particles and four j-particles done in the blocks DX, DY, and 
DZ are depicted in figure 4. As an example, the calculation 
of r ij,x are processed as follows. 

1. Load four r hX (j = jO ■ ■ ■ j3) to YMMOO from the main 
memory in DP. 

2. Subtract the coordinate of the first i-particles rio. x in 
the main memory from r^ x in YMMOO, and store the 
result, r l0j >, in YMM01. 

3. Subtract the coordinate of the second z-particle rn. x 
in the main memory from rj tX in YMMOO, and store the 



result. 



in YMM02. 



4. Convert type of r m ^ x in YMM 01 from DP to SP, and 
store the result in the lower bits of YMM01. 

5. Convert type of rnj tX in YMM02 from DP to SP, and 
store the result in the lower bits of YMM02. 

6. Copy Tioj t x in the lower bits of YMM01 to the lower bits 
of YMM03, and raj, x in the lower bits of YMM 02 to the 
upper bits of YMM03. 

Figure 5 illustrates the procedure to accumulate the ac- 
celerations and potentials on two i-particles exerted by four 
j-particles in DP. In the step 2, eight accelerations ay and 
potentials <f>ij between two i-particles and four j-particles 
are calculated on the YMM registers in SP. Note that jerks 
are accumulated in SP, instead of DP. In fact, we cannot 
compute the summations of accelerations and potentials 
over all four j-particles in DP using the AVX instructions. 
Instead, two partial summations over two j-particles can 
be computed for each z-particlc. In accumulating gravita- 
tional potentials, for example, the procedures are given as 
follows. 

1. Convert 4>ioj (j = jO - ■ ■ j3) in the lower 128 bits of 
YMM 02 from SP to DP, and store the result in YMM06. 

2. Copy (f>iij (j = jO ■ ■ ■ j3) in the upper 128 bits of 
YMM 02 to the lower 128 bits of YMM07. 

3. Convert cpnj in the lower 128 bits of YMM07 from SP 
to DP, and store the result in YMM07. 

4. Operate a horizontal sum reduction of YMM06 and 
YMM07, obtaining iO jO + (j) l0 ji, (f>iijo + (f>iiji, <t>i0j2 + 
<t>i0j3, and 4>nj2 + 4>nj3 and store them in YMM06. 

5. Accumulate the partially reduced potentials of i- 



particles with indices of iO and il in YMM06 to cf>i in 
YMM09. 

The function calc_accjerkpot which computes the 
forces on two z-particles from all the j-particles is shown 
in List 3. In the force loop at line 15 in List 3, the accel- 
erations, jerks, and potentials on i-particles are calculated 
as we described. Before entering the force loop, i-particles 
are set, the pointer to j-particles is set to point the first 
element of the array of Jpart, and all the registers are 
set to zero. Once the force loop has been finished, sum 
reductions are operated on the partially accumulated ac- 
celerations, jerks, and potentials, and the results are stored 
in components of the array of structure NewAccJerk. The 
resulting accelerations, jerks, and potentials are multiplied 
by some coefficients. This is because we omit the multipli- 
cation of —1/2 in equation (12) in the force loop. (In this 
source code, we do not correct the bias in the VRSQRTPS 
instruction for simplicity.) 

As seen in List 3, some operands are specified by XMM, 
which are aliases of the XMM registers. In some instruc- 
tions, the lower bits of the YMM register is only used for 
their operands. When such instructions are operated, the 
XMM registers have to be specified. 



List 3. A force loop to calculate 
instructions 



the Newton's force using the AVX 



1 

2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
IS 
16 
17 
IS 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 



void cal c _ac c j erkpot ( s t ru 
stru 
stru 



double xacc[4] , ya 
float xjrk [8] , yj 
float veps2[8] = {e 
float three [8] = {3 
float threef ourth [8 
struct Ipart *iptr 
struct Jpart *jptr 



ct Ipart *ipart , 
ct Jpart * jpart , 
ct NewAccJerk *accjerk) 

c [4] , zacc [4] , pot [4] ; 
k[8] , zjrk[8] ; 
ps2, eps2 , ... , eps2>; 
, 3.0, ... , 3.0}; 

] = {0 . 75 , 0.75 . 75} ; 

ipart ; 
jpart ; 



// init 
VZER0AL 
for(j 
II 
VL0AD 
VSUBP 
VSUBP 
VCVTP 
VCVTP 
VMERG 
II 
VL0AD 
VSUBP 
VSUBP 
VCVTP 
VCVTP 
VMERG 
// r_ 
VL0AD 
VSUBP 
VSUBP 
VCVTP 
VCVTP 
VMERG 
// (r 
VL0AD 
VMULP 
VADDP 
VMULP 
VADDP 



lize 



; j < nj ; j + = 
,x -> YMM03 
(jptr->xpos [0] 
MCipart->xpos[ 
M C iptr - >xpos [4 
PSCYMM01, XMM0 
PSCYMM02, XMM0 
YMM01 , YMM02 , 
,y -> YMM04 
( jptr->ypos [0] 
M C iptr - >ypos [0 
MCiptri->ypos [ 
PSCYMM01, XMM0 
PS ( YMM02 , XMM0 
YMM01 , YMM02 , 
,z -> YMM05 
(jptr->zpos [0] 
M C iptr - >zpos [0 
M C iptr - >zpos [4 
PSCYMM01, XMM0 
PS ( YMM02 , XMM0 
YMM01 , YMM02 , 
j)~2 -> YMM01 
(veps2 [0] , YMM 
YMM03 , YMM03 , 
YMM00, YMM01, 
YMM04 , YMM04 , 
YMMOO, YMM01, 



, jptr++){ 

YMMOO) ; 
0] , YMMOO , YMM01) ; 
] , YMMOO , YMM02) ; 

1) ; 

2) ; 

YMM03) ; 

, YMMOO); 

] , YMMOO , YMM01) ; 

4] , YMMOO , YMM02) ; 

1) ; 

2) ; 

YMM04) ; 



YMMOO) ; 
] , YMMOO , 
] , YMMOO , 

1) ; 

2) ; 

YMM05) ; 



YMM0 1 ) 
YMM02) ; 



01) ; 

YMMOO) 

YMM01) 

YMMOO) 

YMM01) 



T 



mem 



V 



06 



V 



mem 



V 



07 



V 

j.y 



mem 



mem 



m 



MASK 



01 



N.R. 



DX 



03 



08 



V 



DY 



04 



DZ 



•j.y 



05 



oa 



v 



ILX 



01 



v 



oa 



v 



06 



02 



09 



01 



m/r 

j ij 



1/r 



02 



I 



m/r 3 
i , i i 



SUM 



PHI 



oi j_ 



1/r 



03 



m r /r 3 

j ij.x ij 



04 



mr /r 3 
j M. if 



05 



mr /r 3 

/ M U 



SUM 



10 



AX 



►SUM 



AY 



►SUM 



12 



AZ 



m v /r i 
i m. y 



m (r -v )r /r 

£ ij if ij.x ij 



07 



m v /r 3 
i m g 



'.y 



04 



m (r -v )r /r " 



oa 



mv /r 

j ij-z ij 



7, 



m (r -v )r /r 3 

J ij if ij.z ij 



00 



r -v 

Jf 'M. 



0Q 



r -v /r 

ij ij ij 



Fig. 2. A schematic illustration of the force loop. Numbers at upper left of each box indicate indices of the YMM registers in which data are 
stored, and "mem" indicates the main memory. Data assignment in all the registers are the same, and are drawn in figure 3. 
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jO il 
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Fig. 4. A schematic illustration of the blocks DX, DY, and DZ displayed in figure 2. Numbers at upper left of each box indicate indices of 
the YMM registers in which data are stored, and "mem" indicates the main memory. 



VMULPS (YMM05 , YMM05 , YMM00); 
VADDPS (YMMOO , YMM01, YMM01); 
11-11 r_ij -> YMM01 
VRSQRTPS (YMM01 , YMM02) ; 
VMULPS ( YMM02 , YMM01, YMM01); 
VMULPS (YMM02 , YMM01, YMM01); 
VSUBPS_M(three [0] , YMM01, YMM01); 
VMULPS (YMM02 , YMM01, YMM01); 
// exclude self interaction 
VLOADPS ( jptr->id [0] , YMM02) ; 
VLOADPS (iptr->id [0] , YMMOO); 
VCMPNEQPS (YMMOO , YMM02 , YMM02); 
VANDPS (YMM02 , YMM01, YMM01); 
// phi_i -> YMM09 

VMULPS_M(jptr->mass [0] , YMM01, YMM02); 
VCVTPS2PD CXMM02 , YMMOO); 
VUP2L0W CYMM02 , XMM06); 
VCVTPS2PD CXMM06 , YMM06 ) ; 
VHADDPD CYMM06 , YMMOO, YMM07 ) ; 



62 
63 
64 
65 
66 
67 
68 
69 
70 
71 
72 
73 
74 
75 
76 
77 
78 
79 
80 



VADDPD CYMM07 , 
// v_ij,x, v_ 
VLOADPSCjptr- 
VSUBPS_M(iptr 
VLOADPSCjptr- 
VSUBPS_M(iptr 
VLOADPSCjptr- 
VSUBPS_M(iptr 
// r_ij * v_i 
VMULPS (YMM03 , 
VMULPS CYMM04 , 
VADDPS CYMM02 , 
VMULPS CYMM05 , 
VADDPS CYMM02 , 
// 3.0 * r_i j 
// - m_j / r_ 
VMULPS_M( jptr 
VMULPS CYMM01 , 
VMULPS (YMM01 , 



YMM09 , YMM09) ; 
ij >y> v_ij ,z 
>xvel [0] , YMM06) 
->xvel [0] , 
>yvel [0] , 
->yvel [0] , 
>zvel [0] , 
->zvel [0] , 
j -> YMMOO 



YMM06 . 
YMM07) ; 

YMM07 . 
YMM08) ; 
YMM08 , 



YMMOO) ; 
YMM02) ; 
YMMOO) ; 
YMM02) ; 
YMMOO) ; 
/ Cr_ij)" 
YMM02 
>mass [0] , YMM01 , 
YMM01, YMM01); 
YMMOO, YMMOO); 



YMM06 , 
YMM07 , 
YMMOO , 
YMM08 , 
YMMOO , 
* v - 1 j 
j-3 -> 



YMM06) 
YMM07) 
YMM08) 



YMM02) ; 
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Fig. 5. A schematic illustration of SUM shown in figure 2. Numbers at upper left of each box indicate indices of the YMM registers in which 
data are stored, and "mem" indicates the main memory. 
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07 
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100 
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102 
103 
104 
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107 
108 
100 
110 

111 

112 
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114 
115 
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VMULPS (YMM01 , YMM02 , YMM02); 
VMULPS_M(threefourth[0] , YMM00 , YMM00) ; 
// prefetch 

PREFETCH ( ( jptr + 1) ->xpos [0] ) ; 
PREFETCH ( (jptr + 1) ->zpos [0] ) ; 
PREFETCH ( (jptr + 1) ->mass [0] ) ; 
PREFETCH ( (jptr+1) ->yvel [0] ) ; 

(first term) 
YMM06) ; 
YMM13) ; 
(first term) 
YMM07) ; 
YMM14) ; 
(first term) 
YMM08) ; 
YMM15) ; 



// j_i ,x + = j 
VMULPS ( YMM02 , 
VADDPS (YMM06 , 

// J-i.y + = j 

VMULPS (YMM02 , 
VADDPS (YMM07 , 
// j_i ,z + = j 
VMULPS ( YMM02 , 
VADDPS (YMM08 , 
/ / ax_i , x + = 
VMULPS ( YMM02 , 



-ij i 
YMM06 , 
YMM13 , 

YMM07 , 
YMM14 , 

-ij i 
YMM08 , 
YMM1B , 

ax_i j , 5 
YMM03 , 



YMM03) 



VCVTPS2PD (XMM03 , YMM06) ; 
VUP2LDW(YMM03 , XMM07); 
VCVTPS2PD (XMM07 , YMM07) ; 
VHADDPD (YMM07 , YMM06 , YMM06); 
VADDPD (YMM06 , YMM10, YMM10); 
/ / ay_i , y += ay_i j , y 
VMULPS (YMM02 , YMM04 , YMM04); 
VCVTPS2PD (XMM04 , YMM06); 
VUP2L0W(YMM04 , XMM07) ; 
VCVTPS2PD (XMM07 , YMM07) ; 
VHADDPD (YMM07 , YMM06 , YMM06); 
VADDPD (YMM06 , YMM11, YMM11); 
// az_i,z += az_ij,z 
VMULPS (YMM02 , YMM05 , YMM05); 
VCVTPS2PD (XMM05 , YMM06) ; 
VUP2L0W(YMM05 , XMM07); 
VCVTPS2PD (XMM07 , YMM07) ; 
VHADDPD (YMM07 , YMM06 , YMM06); 



117 
118 
110 
120 
121 
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123 
124 
125 
126 
127 
128 
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134 
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139 
140 
141 
142 
143 
144 
145 
146 
147 
148 
140 
150 
151 
152 



VADDPD (YMM06 
// j_i ,x += 
VMULPS (YMM00 
VSUBPS (YMM03 

// j-i .y += 
VMULPS (YMMOO 
VSUBPS (YMM04 
// j_i,z += 
VMULPS (YMMOO 
VSUBPS (YMM05 



VST0RPD(YMM09 , 
VST0RPD(YMM10 , 
VST0RPDCYMM11 , 
VST0RPD(YMM12 , 
VSTORPD (YMM 13 , 
VST0RPD(YMM14 , 
VST0RPD(YMM15 , 



YMM12 , 
-ij .x ( 
YMM03 , 
YMM13 , 

-ij .y ( 

YMM04 , 
YMM14 , 
-ij .z ( 
YMM05 , 
YMM15 , 



YMM12) ; 
second term) 

YMM03) ; 

YMM13) ; 
second term) 

YMM04) ; 

YMM14) ; 
second term) 

YMM05) ; 

YMM 1 5 ) ; 



pot [0] ) ; 
xacc [0] ) 
yacc [0] ) 
zacc [0] ) 
xjrk[0] ) 
yjrk[0] ) 
zjrk[0]) 



acc j erk [0] 

ac c j erk [ 1] 
acc j erk [1] 
ac c j erk [1] 
acc j erk [1] 
ac c j erk [1] 

ac c j erk [1] 



xacc = 


- 


125 


* 


(xacc [0] 


+ 


xacc [2] ) 


yacc = 


- 


125 




(yacc [0] 


+ 


yacc [2] ) 


zacc = 


- 


125 


* 


(zacc [0] 


+ 


zacc [2] ) 


pot 





5 




(pot [0] 


+ 


pot [2] ) ; 


xjrk = 


- 


125 


* 


(xjrk [0] 


+ 


xjrk [1] 








+ 


xjrk [2] 


+ 


xjrk[3] ) 


yjrk = 


- 


125 


* 


(yjrk[0] 


+ 


yjrk [1] 








+ 


yjrk[2] 


+ 


yjrk[3] ) 


zjrk = 


- 


125 


* 


(zjrk [0] 


+ 


zjrkll] 








+ 


z j rk [2] 


+ 


zjrk[3] ) 


xacc - 


- 


125 


* 


(xacc [1] 


+ 


xacc [3] ) 


yacc = 


- 


125 


* 


(yacc [1] 


+ 


yacc [3] ) 


zacc = 


- 


125 


* 


(zacc [1] 


+ 


zacc [3] ) 


pot 





5 


* 


(pot [1] 


+ 


pot [3] ) ; 


xjrk = 


- 


125 


* 


(xjrk [4] 


+ 


xjrk[5] 








+ 


x j rk [6] 


+ 


xjrk[7] ) 


yjrk = 


- 


125 


* 


(yjrk [4] 


+ 


yjrk[5] 



i n 



153 
154 
155 
156 
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4. Accuracy 

In this section, we present the accuracy of our implemen- 
tation in terms of errors in accelerations, jerks, and poten- 
tials of individual particles for a given snapshot as well as 
errors in the total energy in time integration. 

We measure the errors of accelerations, jerks, and poten- 
tials of individual particles in the Plummer's model with 
N = IK, 4K, and 16K (IK = 1024). Our implementation 
adopts the standard iV-body units, G = M = r v = 1, 
where G is the gravitational constant, M and r v are the 
mass and virial radius of the system, respectively. The soft- 
ening length is set to A/N. The relative errors of accel- 
erations, jerks, and gravitational potentials are defined as 
|o-ODp|/|a|, |j-J D p|/b'l and 10- <t>r>p\/\<t>\, respectively, 
where the reference values, aop, Jrjp an d 0dp are obtained 
by computing fully in DP for all operations, such as com- 
putation of distances, square roots, divisions, etc. 

Figure 6 shows the cumulative distribution of errors of 
accelerations, jerks, and gravitational potentials for indi- 
vidual particles in the Plummer's model with N = IK, 4K, 
and 16K. Most accelerations and potentials have relative 
errors around 10~ 8 . This is quite natural because, in our 
implementation, accelerations and potentials for individual 
pairs of i- and j-particles are calculated in SP. Accuracies 
of jerks are lower than those of accelerations and potentials 
because relative velocity vectors between i- and j-particles 
are calculated in SPs, and their accumulations are also op- 
erated in SP. 

Note that the relative errors of jerks increase with the 
number of particles, which can be clearly seen in the upper 
panel of figure 7. This is caused by the errors in accumu- 
lating jerks of individual pairs of i- and j- particles in SP. 
If we accumulate jerks in DP in the same manner as accel- 
erations and potentials, the distributions of errors become 
almost independent of the number of particle as can be seen 
in the lower panel of figure 7. We speculate that the origin 
of the errors arc round-off errors in accumulating jerks in 
SP. In order to address the effect of errors in jerks, we per- 
form the two runs with jerks accumulated in SP, and in DP 
with various number of particles up to N = 256K. We do 
not find any significant differences in the energy conserva- 
tion throughout the duration of our calculations (t = 0.125 
iV-body units). We thus employ the accumulations of jerks 
in SP, since the accumulation of jerks in DP degrades the 
total performance by 20 - 25 %. However, the accumula- 
tions of jerks in SP could lower the overall accuracy when 
N is large (say N > 10 6 ). 

Figure 8 shows relative error in the total energy conserva- 
tion averaged over the period of 1 /8 crossing time as a func- 
tion of the number of timesteps. The number of timesteps, 
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Fig. 7. Cumulative distribution of relative errors of jerks accumulated 
in SP (upper panel) and in DP (lower panel) for N = IK, 4K, 16K, 
and 64K. 

n s tep, are averaged over all particles during 1 crossing time, 
which corresponds to 2a/2 in the A^-body unit. In order to 
obtain the errors originated only by the force calculation in 
the mixed precisions, we calculate potential energies in DP. 
The errors in the total energy conservation decrease with 
the fourth power of n s t p, since we adopt the fourth-order 
Hermite scheme for the orbit integration of particles. Be- 
cause of the mixed precision for the force calculation, the 
accuracy of the energy conservation stops improving at the 
relative errors in the total energy of 10 -9 even if the num- 
ber of timesteps increases. On the other hand, the accuracy 
of the energy conservation in DP continues to improve for 
larger n st0 p. 

5. Performance 

In this section, we present the performance of our im- 
plementation of the collisional iV-body simulation using 
the AVX instructions. For the measurement of the per- 
formance, we use Intel Core i7-2600 processors with 8MB 
cache memory and a frequency of 3.40 GHz, which contain 
four processor cores. We disable Intel Turbo Boost Tech- 
nology (TB), unless otherwise stated. A compiler which 
we adopt is GCC 4.4.5. We use compilation options -03, 
-f fast -math, and -f unroll-loops. Our code is paral- 
lelized with the Message Passing Interface (MPI) in the 
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Fig. 6. Cumulative distribution of the relative errors of accelerations ("acc"), jerks ("jerk"), and gravitational potentials ("pot") calculated 
in our implementation. The relative errors are calculated for the Plummer's model with TV = IK (left), 4K (middle), and 16K (right). 

Table 3 

Comparison between performances (GFLOPS) with and without 
Intel Turbo Boost Technology (TB) in the cases of 1 MPI process 
on one processor and 8 MPI processes on four processors with the 
Hyper-Threading Technology (HT). 
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Fig. 8. Averaged relative error in the total energy conservation arising 
from the time integration for 1/8 crossing time (0.375 TV-body time 
units) as a function of the number of steps averaged over all particles 
for 1 crossing time (2-\/2 TV-body time units), n s tep, in Plummer's 
model with TV = IK. The force calculations are operated in mixed 
precision (mixed) and DP (double). The dashed line indicates the 
scaling of n~* p . 



same manner as the NINJA scheme described in NMA06, 
and we measure the performance with various numbers of 
MPI processes on a single processors. The code units and 
softening length are the same as those in the previous sec- 
tion. We adopt the Plummer's model for the initial condi- 
tions, and evolve the system from time t = to time t = 1 
in the TV-body units. 

Figure 9 shows the performance of our implementation 
with TV = IK, 2K, 4K, 8K, and 16K, in which the per- 
formance is calculated by estimating the total calculation 
cost of an acceleration, jerk and gravitational potential for 
a pair of i- and j-particles to be 60 floating-point opera- 
tions. Although Makino & Fukushigc (2001) estimated it to 
be 57 floating-point operations, we adopt the estimate by 
NMH06 so that one can directly compare the performance 
of our implementation with that in NMH06. To evaluate 
the performance gain of our implementation with the AVX 
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17.8 


18.2 


18.4 
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1.12 
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1.11 


1.03 


1.03 


1.09 
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1.002 



instructions over that implemented in C-language without 
any explicit use of the SSE and AVX instructions, we mea- 
sure the performance of the C-language code shown in the 
List 3 of NMH06 for V = IK, 2K, and 4K on the same pro- 
cessor. In addition to that, to compare the performance of 
the AVX instructions and the SSE instructions, we also de- 
velop the implementation only with the SSE instructions, 
and measure its performance for TV = IK, 2K, 4K and 8K. 

The performance of our implementation with the AVX 
instructions is ~ 20 GFLOPS and higher than that imple- 
mented with only the SSE instructions (~ 10 GFLOPS) 
by about a factor of two as expected, and 5 times higher 
than that implemented without any SIMD instructions (~ 
4 GFLOPS). One can see that the performance with 4 MPI 
processes on four processor cores reaches ~ 50 GFLOPS 
for N = 1K and - 70 GFLOPS for V = 16K, which arc 2.5 
and 3.5 times higher than those with 1 MPI process on one 
processor core, respectively. When the Hyper- Threading 
Technology (HT) is enabled, the performance with 8 MPI 
processes on four processor cores reaches 90 GFLOPS for 
N = 16K, which is higher than that with 4 MPI processes 
on four processor cores (70 GFLOPS). Note that the per- 
formance with 8 MPI processes on four processor cores 
(50 GFLOPS for TV = 16K) becomes lower than that with 4 
MPI processes on four processor cores when HT is disabled. 

Table 3 shows a comparison between the performance 
when TB is enabled and disabled. TB enables CPU fre- 
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Fig. 9. Performance of our implementation on 1 CPU with N = IK, 
2K, 4K, 8K, and 16K. Open and filled circles indicate the results 
on one and four processor cores, respectively. The results with 1, 
4 and 8 MPI processes ("proc") are shown. In the case of 8 MPI 
processes, Hyper-Threading Technology is enabled ("w/ HT"), and 
disabled ("w/o HT"). The results of the implementation only with 
the SSE instructions and with no SIMD instructions arc also shown. 
The performances of GRAPE-DR and GPU (GTX 470) are plotted 
in dashed lines for comparison. In order to obtain the performance 
of the GPU, Yebisu N-body code (Nitadori, 2009; Spurzem et al., 
2010) is used. We use Nitadori's code for the calculation of "w/o 
SIMD" . His source code for the force calculation is described in List 
3 of NMH06. 

quency to be increased from 3.4 GHz to 3.8 GHz (one ac- 
tive core) and to 3.5 GHz (four active cores); thus TB con- 
tributes to performance gains of 10 % (one active core) and 
3 % (four active cores). As expected, one can see that the 
gains are, respectively, about 10 % and 3 % in the cases of 1 
MPI processes on one processor core and 8 MPI processes 
on four processor cores. 

Note that the performance of our implementation is al- 
most independent of the number of particles. This feature 
is advantageous over the dependences on the number of 
particles by external hardware accelerators such as GPUs 
and GRAPEs. The performances obtained by a GRAPE- 
DR board with 4 GRAPE-DR chips (400 MHz) mounted 
and by a GTX 470 GPU with Yebisu code (Nitadori, 2009; 



Spurzem et al., 2010) are presented in figure 9 in dashed 
lines, which strongly depends on the number of particles 
and relatively lower for the smaller numbers of particles. 
This is due to the overhead in transferring the data between 
extended hardware and host computers. In our implemen- 
tation with the AVX instructions, we do not have such an 
overhead arising from the transfer of particle data, and thus 
the performance is almost independent of the number of 
particles. One can see that the performance of our imple- 
mentation with a single CPU is slightly lower than that of 
the GRAPE-DR and GPU systems for N < 10 4 (by half 
or one third), and even higher for N < 1 x 10 3 . This fea- 
ture is quite desirable for simulations on massively paral- 
lel computers, because the performance independent of the 
number of particles allows us to achieve fairly good "strong 
scaling" . Furthermore, this feature is also advantageous un- 
der wide dynamic range, such as the case of the presence 
of binaries in star clusters. 



6. Summary and discussion 

We have developed an A-body code for a collisional sys- 
tem with the new SIMD instruction set available on the 
Sandy Bridge architecture released by Intel Corporation, 
the AVX instruction set. The performance is ~ 20 GFLOPS 
with one processor core, which is two times higher than 
that with the SSE instructions as expected, and five times 
higher than that in C-language without any explicit use of 
SIMD instructions. The performance on a single CPU with 
four processor cores reaches 90 GFLOPS. 

Here, let us estimate the performance of our code paral- 
lelized with the NINJA scheme on massively parallel sys- 
tems. As a reference, NMA06 achieved about 4 GFLOPS 
with one processor core in a dual-core Opteron proces- 
sors, and 2.03 tera FLOPS (TFLOPS) with 400 dual-core 
Opteron processors for an N — 64K system using the 
NINJA scheme. Since we have achieved 20 GFLOPS with 
one core on a Intel Core i7-2600 processor, the performance 
of 10 TFLOPS can be achieved with a massively parallel 
system with the same number of processor cores of Core-i7 
2600 processors. 

There are two advantages of our approach to acceler- 
ate A-body simulations using SIMD instructions imple- 
mented on CPUs over the use of external hardwares such 
as GPUs and GRAPE families. One is the portability of 
our code. We can carry out A-body simulations with good 
performance on any computer systems without GPUs and 
GRAPE boards. Even on the systems with instruction set 
architectures other than the x86 architecture, the fact that 
SIMD instruction sets similar to the SSE and AVX instruc- 
tion set are available (e.g. Vector Multimedia Extension on 
Power series, HPC-ACE on SPARC64 VHIfx, etc.) suggests 
that our approach to use SIMD instructions is successful on 
any architecture. The second advantage is the fact that the 
performance of our implementation is almost independent 
of the number of particles because there is no overhead as- 



sociatc with the communication between a CPU and an ex- 
ternal hardware, such as a GRAPE and a GPU. As a result, 
the performance of our implementation is better than that 
aided by external hardwares. In other words, our code have 
very good strong scaling on massively parallel systems. 

Finally, let us mention the future improvement of our 
implementation. The performance of our TV-body code will 
be higher when Fused Multiply- Add (FMA) instructions 
are introduced. Using the FMA instructions which com- 
pute the product of two numbers and add the result to an- 
other number in one step, we will be able to improve the 
performance and the accuracy of the code, especially accu- 
mulation of jerks in the force loop. The FMA instructions 
will be available in the next-generation CPU named " Bull- 
dozer" by AMD Corporation, and " Haswell" by Intel Cor- 
poration, which are scheduled to be released in mid-2011, 
and in 2013, respectively. Our code is publicly available at 
http:/ /code. google. com/p/phantom-grape/. 
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